Skip to content

Conversation

@igaw
Copy link
Collaborator

@igaw igaw commented Feb 7, 2025

These changes broke the nightly CI runs. We were unable to figure out in time what it causes the fail, thus revert these changes.

Introduced here with #2665 to address #2662

Fixes: #2668

@igaw
Copy link
Collaborator Author

igaw commented Feb 7, 2025

The nightly CI still fails: https://github.com/linux-nvme/nvme-cli/actions/runs/13201431769

I've checked if we don't hunt a ghost, that is around 3 weeks ago the 6.14 bits have been merged into Linus' tree. I thought we are following this tree. Anyway, I didn't spot any changes which could explain this problem.

@ikegami-t
Copy link
Contributor

Agreed. Thank you.

@igaw
Copy link
Collaborator Author

igaw commented Feb 7, 2025

Something is a still off. I hoped a revert would help but the test run of the nightly CI doesn't pass.

@ikegami-t
Copy link
Contributor

Can we retest the CI with the commit cccefc4 that pushed before the blkdev changes and also the TP changes not applied?

tokunori@tokunori-desktop:~/nvme-cli$ git log --oneline -2 565b30e2
565b30e2 nvme: use blkdev for direct if namespace-id specified
cccefc48 argconfig: add argconfig_get_value()

@MaisenbacherD
Copy link
Contributor

Can we retest the CI with the commit cccefc4 that pushed before the blkdev changes and also the TP changes not applied?

tokunori@tokunori-desktop:~/nvme-cli$ git log --oneline -2 565b30e2
565b30e2 nvme: use blkdev for direct if namespace-id specified
cccefc48 argconfig: add argconfig_get_value()

I just tested this, but the nvme_create_max_ns_test fails starting on commit 565b30e. I am also getting a test fail for the revert branch on Daniel's fork. Already looking into this issue and see what is going on :)

@ikegami-t
Copy link
Contributor

Noted so looks the test fail on the revert branch caused by the TP changes. Thank you.

@MaisenbacherD
Copy link
Contributor

After looking into this issue, it turns out that 565b30e (nvme: use blkdev for direct if namespace-id specified)
initially caused the issue discovered by nvme_create_max_ns_test.
With this commit, devices are opened by the namespace (blkdev) if already existing rather the controller (chardev) and assigns it to the nvme_dev struct.

When trying to detach(/attach) a new namespace, nvme_attach_ns errors out if the given nvme_dev is a blkdev:

nvme-cli/nvme.c

Lines 2955 to 2959 in 724836d

if (is_blkdev(dev)) {
nvme_show_error("%s: a block device opened (dev: %s, nsid: %d)", cmd->name,
dev->name, cfg.namespace_id);
return -EINVAL;
}

So reverting this change also makes sense to me until we come up with a version that does not break this scenario.
@ikegami could you maybe explain what you tried to fix with commit 565b30e? :)

Reverting just the commits of #2665 was not enough to let the tests pass as the test cases were also modified, which uncovered some more errors.
f54f046 introduced checking for the correct ns after attaching and was refined in 56cf51b.
Previously we only checked for the presence of n1 also when attaching namespaces > 1:

nvme-cli/tests/nvme_test.py

Lines 407 to 414 in 1024886

err = subprocess.call(attach_ns_cmd,
shell=True,
stdout=subprocess.DEVNULL)
if err == 0:
# enumerate new namespace block device
self.nvme_reset_ctrl()
# check if new namespace block device exists
err = 0 if stat.S_ISBLK(os.stat(self.ns1).st_mode) else 1

Somehow the block device for ns > 1 does not appear in the device tree waiting for it after attaching it to the controller.
I will still look a bit further into this issue as I have not found the solution yet. :)

@ikegami-t
Copy link
Contributor

So reverting this change also makes sense to me until we come up with a version that does not break this scenario.

Yes I can agree with the revert.

@ikegami could you maybe explain what you tried to fix with commit 565b30e? :)

The commit was to resolve the issue #2662 by using the blkdev instead of the chardev.

Note: The PR #2671 changes are to revert the blkdev changes partially to use the blkdev only for IO commands. If you have any time or chance to retest the PR changes also as if the test fail resolved it will be helpful for me to investigate the cause. But sorry not sure if the changes can be merged actually even if the works correctly.

@igaw
Copy link
Collaborator Author

igaw commented Feb 10, 2025

The whole open device logic is a convoluted and very hard to understand. I think it's time to redesign it. That's why I haven't pulled #2671, it just makes it even more complex.

@ikegami
Copy link

ikegami commented Feb 10, 2025

@ikegami could you maybe explain what you tried to fix with commit 565b30e? :)

I am not Tokunori Ikegami [email protected]

@MaisenbacherD
Copy link
Contributor

@ikegami could you maybe explain what you tried to fix with commit 565b30e? :)

The commit was to resolve the issue #2662 by using the blkdev instead of the chardev.

I see. Thanks :)

The whole open device logic is a convoluted and very hard to understand. I think it's time to redesign it. That's why I haven't pulled #2671, it just makes it even more complex.

Sounds good.

@MaisenbacherD
Copy link
Contributor

@ikegami could you maybe explain what you tried to fix with commit 565b30e? :)

I am not Tokunori Ikegami [email protected]

I am so sorry. This probably isn't the first time this happened to you. I will pay attention next time :)

@MaisenbacherD
Copy link
Contributor

The failing run-nightly-tests GitHub action is caused by a missing bind-mount option of the /dev directory.
Without the 'shared' bind-propagation the device tree is not reliably updated within the container.
This causes newly created namespaces not to show up.
The following line

options: '--privileged -e BDEV0'

would need to be changed to something like:

      options: '--privileged -v "/dev":"/dev":z -e BDEV0'

@igaw Do you want to add this change in this PR, or should I send a new PR for this?

@igaw
Copy link
Collaborator Author

igaw commented Feb 11, 2025

Wow, that's a nasty one to debug! Great catch! Let me rebase this PR and add your change to it.

igaw and others added 4 commits February 11, 2025 12:06
This reverts commit 946029c.

Causes regression in nightly CI runs. The problem couldn't be figured out
thus revert it.

Signed-off-by: Daniel Wagner <[email protected]>
This reverts commit 565b30e.

Causes regression in nightly CI runs. The problem couldn't be figured out
thus revert it.

Signed-off-by: Daniel Wagner <[email protected]>
This reverts commit cccefc4.

Causes regression in nightly CI runs. The problem couldn't be figured out
thus revert it.

Signed-off-by: Daniel Wagner <[email protected]>
The failing run-nightly-tests GitHub action is caused by a missing
bind-mount option of the /dev directory. Without the 'shared'
bind-propagation the device tree is not reliably updated within the
container. This causes newly created namespaces not to show up.

Signed-off-by: Dennis Maisenbacher <[email protected]>
@igaw
Copy link
Collaborator Author

igaw commented Feb 11, 2025

The test run with nightly CI action was successful!

Although this was a nasty one to figure out, I am sure the improvements in the test framework (polling etc) is a great step forward. Again thanks everyone!

@igaw igaw merged commit 1627604 into linux-nvme:master Feb 11, 2025
18 checks passed
@igaw igaw deleted the revert branch February 11, 2025 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

attach/detach command line interface changes broke nightly tests

4 participants